This analysis examines the Wisconsin Breast Cancer dataset, an important medical resource for breast cancer detection and diagnosis. The collection, collected by Dr. William H. Wolberg, contains vital breast cancer tumor cell nuclei data. Exploring and understanding the dataset using sophisticated machine learning is the main goal. Data splitting, handling missing values, and outliers are pre-processing processes. We also do extensive Exploratory Data Analysis (EDA) including feature engineering and statistical analyses. We then use clustering methods and dimensionality reduction in unsupervised machine learning to reveal patterns. Following this, we use supervised machine learning for classification and regression. With ethical concerns in healthcare data analysis, our analysis is strengthened by informative visualizations, detailed assessments, and reflections on the methodology used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score, silhouette_score
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error
from scipy.stats import shapiro, mannwhitneyu
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv("MS4S16_Dataset.csv")
# Splitting the dataset into Training Set and Test Set
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
# checking null values
data.isnull().sum()
id 3 diagnosis 3 radius_mean 5 texture_mean 6 perimeter_mean 4 area_mean 5 smoothness_mean 3 compactness_mean 4 concavity_mean 4 concave points_mean 8 symmetry_mean 3 fractal_dimension_mean 4 radius_se 6 texture_se 8 perimeter_se 3 area_se 6 smoothness_se 6 compactness_se 7 concavity_se 8 concave points_se 9 symmetry_se 8 fractal_dimension_se 7 radius_worst 13 texture_worst 21 perimeter_worst 6 area_worst 4 smoothness_worst 9 compactness_worst 4 concavity_worst 3 concave points_worst 6 symmetry_worst 4 fractal_dimension_worst 13 dtype: int64
# Handle missing values
numeric_columns = train_data.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(strategy="mean")
train_data[numeric_columns] = imputer.fit_transform(train_data[numeric_columns])
test_data[numeric_columns] = imputer.transform(test_data[numeric_columns])
# Handle duplicated values
train_data = train_data.drop_duplicates(keep="first")
data.describe()
| id | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.680000e+02 | 566.000000 | 565.000000 | 567.000000 | 566.000000 | 568.000000 | 567.000000 | 567.000000 | 563.000000 | 568.000000 | ... | 558.000000 | 550.000000 | 565.000000 | 567.000000 | 562.000000 | 567.000000 | 568.000000 | 565.000000 | 567.000000 | 558.000000 |
| mean | 3.011402e+07 | 14.103267 | -241.973664 | 91.949048 | 654.942403 | 0.096312 | 0.104333 | 0.088712 | -3.500369 | 0.187402 | ... | 16.269794 | 25.735691 | 110.948035 | 897.936508 | 0.132469 | 0.254412 | 0.272125 | 0.114470 | 0.290327 | 0.084020 |
| std | 1.250894e+08 | 3.517424 | 445.216862 | 24.358029 | 352.555899 | 0.014178 | 0.052878 | 0.079739 | 59.492306 | 0.115008 | ... | 4.842370 | 6.123776 | 59.245691 | 688.231051 | 0.022865 | 0.157582 | 0.208867 | 0.065854 | 0.061907 | 0.018171 |
| min | 8.670000e+03 | 6.981000 | -999.000000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | -999.000000 | 0.000700 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
| 25% | 8.690778e+05 | 11.692500 | -999.000000 | 75.190000 | 420.300000 | 0.086290 | 0.064710 | 0.029520 | 0.019885 | 0.161900 | ... | 13.015000 | 21.222500 | 84.160000 | 515.550000 | 0.116850 | 0.147450 | 0.114475 | 0.064930 | 0.250450 | 0.071318 |
| 50% | 9.060010e+05 | 13.320000 | 17.000000 | 86.240000 | 548.750000 | 0.095895 | 0.092630 | 0.061540 | 0.033340 | 0.179200 | ... | 14.965000 | 25.455000 | 97.820000 | 686.600000 | 0.131350 | 0.214100 | 0.227450 | 0.099930 | 0.282600 | 0.079960 |
| 75% | 8.812852e+06 | 15.780000 | 21.010000 | 104.200000 | 787.050000 | 0.105325 | 0.130400 | 0.130000 | 0.073520 | 0.195700 | ... | 18.782500 | 29.705000 | 126.900000 | 1091.500000 | 0.146000 | 0.339500 | 0.383500 | 0.161300 | 0.318550 | 0.092088 |
| max | 9.113205e+08 | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 2.100000 | ... | 36.040000 | 49.540000 | 910.000000 | 10056.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 31 columns
# Handle outliers
outlier_detector = IsolationForest(contamination=0.05)
train_data["outlier"] = outlier_detector.fit_predict(train_data.drop(["diagnosis", "id"], axis=1))
train_data = train_data[train_data["outlier"] != -1].drop("outlier", axis=1)
# Read the dataset from 'MS4S16_Dataset.csv'
data = pd.read_csv('MS4S16_Dataset.csv')
# Check data types of each column
for column in data.columns:
if data[column].dtype == 'object':
# If the column has string data, fill use the mode (most frequent value) to fill NaN values
data[column] = data[column].fillna(data[column].mode()[0])
else:
# For numeric columns, replace NaN with mean
data[column] = data[column].fillna(data[column].mean())
# Display the DataFrame after handling NaN values
print("DataFrame after handling NaN values:")
print(data)
DataFrame after handling NaN values:
id diagnosis radius_mean texture_mean perimeter_mean \
0 842302.0 M 17.99 10.38 122.80
1 842517.0 M 20.57 17.77 132.90
2 84300903.0 M 19.69 21.25 130.00
3 84348301.0 M 11.42 20.38 77.58
4 84358402.0 M 20.29 14.34 135.10
.. ... ... ... ... ...
566 926682.0 M 20.13 28.25 131.20
567 926954.0 M 16.60 28.08 108.30
568 927241.0 M 20.60 29.33 140.10
569 92751.0 B 7.76 24.54 47.92
570 92751.0 B 7.76 24.54 47.92
area_mean smoothness_mean compactness_mean concavity_mean \
0 1001.0 0.11840 0.27760 0.30010
1 1326.0 0.08474 0.07864 0.08690
2 1203.0 0.10960 0.15990 0.19740
3 386.1 0.14250 0.28390 0.24140
4 1297.0 0.10030 0.13280 0.19800
.. ... ... ... ...
566 1261.0 0.09780 0.10340 0.14400
567 858.1 0.08455 0.10230 0.09251
568 1265.0 0.11780 0.27700 0.35140
569 181.0 0.05263 0.04362 0.00000
570 181.0 0.05263 0.04362 0.00000
concave points_mean ... radius_worst texture_worst perimeter_worst \
0 0.14710 ... 25.380 17.33 184.60
1 0.07017 ... 24.990 23.41 158.80
2 0.12790 ... 23.570 25.53 152.50
3 0.10520 ... 14.910 26.50 98.87
4 0.10430 ... 22.540 16.67 152.20
.. ... ... ... ... ...
566 0.09791 ... 23.690 38.25 155.00
567 0.05302 ... 18.980 34.12 126.70
568 0.15200 ... 25.740 39.42 184.60
569 0.00000 ... 9.456 30.37 59.16
570 0.00000 ... 9.456 30.37 59.16
area_worst smoothness_worst compactness_worst concavity_worst \
0 2019.0 0.16220 0.66560 0.7119
1 1956.0 0.12380 0.18660 0.2416
2 1709.0 0.14440 0.42450 0.4504
3 567.7 0.20980 0.86630 0.6869
4 1575.0 0.13740 0.20500 0.4000
.. ... ... ... ...
566 1731.0 0.11660 0.19220 0.3215
567 1124.0 0.11390 0.30940 0.3403
568 1821.0 0.16500 0.86810 0.9387
569 268.6 0.08996 0.06444 0.0000
570 268.6 0.08996 0.06444 0.0000
concave points_worst symmetry_worst fractal_dimension_worst
0 0.2654 0.4601 0.11890
1 0.1860 0.2750 0.08902
2 0.2430 0.3613 0.08758
3 0.2575 0.6638 0.17300
4 0.1625 0.2364 0.07678
.. ... ... ...
566 0.1628 0.2572 0.06637
567 0.1418 0.2218 0.07820
568 0.2650 0.4087 0.12400
569 0.0000 0.2871 0.07039
570 0.0000 0.2871 0.07039
[571 rows x 32 columns]
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 571.0 | 3.011402e+07 | 1.247598e+08 | 8.670000e+03 | 869161.000000 | 906290.000000 | 8.836916e+06 | 9.113205e+08 |
| radius_mean | 571.0 | 1.410327e+01 | 3.501963e+00 | 6.981000e+00 | 11.705000 | 13.380000 | 1.576500e+01 | 2.811000e+01 |
| texture_mean | 571.0 | -2.419737e+02 | 4.428674e+02 | -9.990000e+02 | -999.000000 | 16.950000 | 2.099500e+01 | 3.928000e+01 |
| perimeter_mean | 571.0 | 9.194905e+01 | 2.427241e+01 | 4.379000e+01 | 75.235000 | 86.490000 | 1.039500e+02 | 1.885000e+02 |
| area_mean | 571.0 | 6.549424e+02 | 3.510062e+02 | 1.435000e+02 | 420.400000 | 552.400000 | 7.826500e+02 | 2.501000e+03 |
| smoothness_mean | 571.0 | 9.631188e-02 | 1.414045e-02 | 5.263000e-02 | 0.086390 | 0.095940 | 1.053000e-01 | 1.634000e-01 |
| compactness_mean | 571.0 | 1.043333e-01 | 5.269249e-02 | 1.938000e-02 | 0.065090 | 0.094450 | 1.303500e-01 | 3.454000e-01 |
| concavity_mean | 571.0 | 8.871189e-02 | 7.945879e-02 | 0.000000e+00 | 0.029570 | 0.061810 | 1.282500e-01 | 4.268000e-01 |
| concave points_mean | 571.0 | -3.500369e+00 | 5.907334e+01 | -9.990000e+02 | 0.019420 | 0.032640 | 7.052500e-02 | 2.012000e-01 |
| symmetry_mean | 571.0 | 1.874016e-01 | 1.147049e-01 | 7.000000e-04 | 0.161900 | 0.179300 | 1.956500e-01 | 2.100000e+00 |
| fractal_dimension_mean | 571.0 | -1.403337e+01 | 1.175208e+02 | -9.990000e+02 | 0.057470 | 0.061400 | 6.600500e-02 | 9.744000e-02 |
| radius_se | 571.0 | 4.057996e-01 | 2.766729e-01 | 1.115000e-01 | 0.234100 | 0.326500 | 4.759500e-01 | 2.873000e+00 |
| texture_se | 571.0 | 1.220584e+00 | 5.485633e-01 | 3.602000e-01 | 0.846600 | 1.142000 | 1.472000e+00 | 4.885000e+00 |
| perimeter_se | 571.0 | 2.868202e+00 | 2.020423e+00 | 7.570000e-01 | 1.609000 | 2.289000 | 3.343500e+00 | 2.198000e+01 |
| area_se | 571.0 | 4.035017e+01 | 4.548431e+01 | 2.100000e+00 | 17.830000 | 24.620000 | 4.493500e+01 | 5.422000e+02 |
| smoothness_se | 571.0 | 7.026232e-03 | 2.982128e-03 | 1.713000e-03 | 0.005213 | 0.006399 | 8.077500e-03 | 3.113000e-02 |
| compactness_se | 571.0 | 2.534154e-02 | 1.772102e-02 | 2.252000e-03 | 0.013115 | 0.020520 | 3.206500e-02 | 1.354000e-01 |
| concavity_se | 571.0 | 3.189976e-02 | 3.012063e-02 | 0.000000e+00 | 0.015100 | 0.026110 | 4.161500e-02 | 3.960000e-01 |
| concave points_se | 571.0 | 1.178110e-02 | 6.164845e-03 | 0.000000e+00 | 0.007654 | 0.011090 | 1.464500e-02 | 5.279000e-02 |
| symmetry_se | 571.0 | 2.059438e-02 | 8.242557e-03 | 7.882000e-03 | 0.015200 | 0.018790 | 2.348500e-02 | 7.895000e-02 |
| fractal_dimension_se | 571.0 | 6.006402e-03 | 5.016246e-02 | 2.000000e-07 | 0.002253 | 0.003230 | 4.596500e-03 | 1.200000e+00 |
| radius_worst | 571.0 | 1.626979e+01 | 4.786831e+00 | 7.930000e+00 | 13.055000 | 15.050000 | 1.853000e+01 | 3.604000e+01 |
| texture_worst | 571.0 | 2.573569e+01 | 6.009911e+00 | 1.202000e+01 | 21.400000 | 25.590000 | 2.941000e+01 | 4.954000e+01 |
| perimeter_worst | 571.0 | 1.109480e+02 | 5.893305e+01 | 5.041000e+01 | 84.385000 | 98.270000 | 1.265000e+02 | 9.100000e+02 |
| area_worst | 571.0 | 8.979365e+02 | 6.858120e+02 | 1.852000e+02 | 515.850000 | 688.600000 | 1.086000e+03 | 1.005600e+04 |
| smoothness_worst | 571.0 | 1.324685e-01 | 2.268378e-02 | 7.117000e-02 | 0.117150 | 0.131600 | 1.458000e-01 | 2.226000e-01 |
| compactness_worst | 571.0 | 2.544117e-01 | 1.570279e-01 | 2.729000e-02 | 0.147750 | 0.215800 | 3.381000e-01 | 1.058000e+00 |
| concavity_worst | 571.0 | 2.721248e-01 | 2.083164e-01 | 0.000000e+00 | 0.115450 | 0.229800 | 3.819000e-01 | 1.252000e+00 |
| concave points_worst | 571.0 | 1.144705e-01 | 6.550680e-02 | 0.000000e+00 | 0.064985 | 0.101200 | 1.611000e-01 | 2.910000e-01 |
| symmetry_worst | 571.0 | 2.903266e-01 | 6.168968e-02 | 1.565000e-01 | 0.250550 | 0.282700 | 3.181500e-01 | 6.638000e-01 |
| fractal_dimension_worst | 571.0 | 8.401952e-02 | 1.796254e-02 | 5.504000e-02 | 0.071835 | 0.080200 | 9.195000e-02 | 2.075000e-01 |
data.isnull().sum()
id 0 diagnosis 0 radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 dtype: int64
# Encoding categorical features
label_encoder = LabelEncoder()
train_data["diagnosis"] = label_encoder.fit_transform(train_data["diagnosis"])
test_data["diagnosis"] = label_encoder.transform(test_data["diagnosis"])
# Feature engineering
feature_selector = SelectKBest(f_classif, k=10)
selected_features = feature_selector.fit_transform(train_data.drop(["diagnosis", "id"], axis=1), train_data["diagnosis"])
data.hist(figsize = (20,20))
array([[<Axes: title={'center': 'id'}>,
<Axes: title={'center': 'radius_mean'}>,
<Axes: title={'center': 'texture_mean'}>,
<Axes: title={'center': 'perimeter_mean'}>,
<Axes: title={'center': 'area_mean'}>,
<Axes: title={'center': 'smoothness_mean'}>],
[<Axes: title={'center': 'compactness_mean'}>,
<Axes: title={'center': 'concavity_mean'}>,
<Axes: title={'center': 'concave points_mean'}>,
<Axes: title={'center': 'symmetry_mean'}>,
<Axes: title={'center': 'fractal_dimension_mean'}>,
<Axes: title={'center': 'radius_se'}>],
[<Axes: title={'center': 'texture_se'}>,
<Axes: title={'center': 'perimeter_se'}>,
<Axes: title={'center': 'area_se'}>,
<Axes: title={'center': 'smoothness_se'}>,
<Axes: title={'center': 'compactness_se'}>,
<Axes: title={'center': 'concavity_se'}>],
[<Axes: title={'center': 'concave points_se'}>,
<Axes: title={'center': 'symmetry_se'}>,
<Axes: title={'center': 'fractal_dimension_se'}>,
<Axes: title={'center': 'radius_worst'}>,
<Axes: title={'center': 'texture_worst'}>,
<Axes: title={'center': 'perimeter_worst'}>],
[<Axes: title={'center': 'area_worst'}>,
<Axes: title={'center': 'smoothness_worst'}>,
<Axes: title={'center': 'compactness_worst'}>,
<Axes: title={'center': 'concavity_worst'}>,
<Axes: title={'center': 'concave points_worst'}>,
<Axes: title={'center': 'symmetry_worst'}>],
[<Axes: title={'center': 'fractal_dimension_worst'}>, <Axes: >,
<Axes: >, <Axes: >, <Axes: >, <Axes: >]], dtype=object)
# Visualization and exploratory analysis
# Distribution of target variable
plt.figure(figsize=(8, 5))
sns.countplot(x="diagnosis", data=train_data)
plt.title("Distribution of Diagnosis (Malignant=1, Benign=0)")
plt.show()
# Calculating probability
data['diagnosis'].value_counts()/len(data)*100
diagnosis B 62.872154 M 37.127846 Name: count, dtype: float64
data['diagnosis'].value_counts().plot.pie(startangle=50, autopct='%1.1f%%')
plt.show()
# Correlation heatmap
# Assume variable 'diagnosis' is a non-numeric column excluded
plt.figure(figsize=(20, 15))
numeric_columns = data.select_dtypes(include=['number']).columns
numeric_data = data[numeric_columns]
# Heatmap
sns.heatmap(numeric_data.corr(), annot=True)
plt.show()
# Pairplot for selected features
selected_features_df = pd.DataFrame(selected_features, columns=train_data.columns[2:][feature_selector.get_support()])
selected_features_df["diagnosis"] = train_data["diagnosis"]
sns.pairplot(selected_features_df, hue="diagnosis", diag_kind="kde")
plt.suptitle("Pairplot for Selected Features", y=1.02)
plt.show()
# Assessing statistical assumptions and inferences
# Shapiro-Wilk test for normality assumption
for feature in train_data.columns[2:]:
stat, p_value = shapiro(train_data[feature])
print(f"Shapiro-Wilk test for {feature}: p-value = {p_value}")
Shapiro-Wilk test for radius_mean: p-value = 1.7309971164780613e-11 Shapiro-Wilk test for texture_mean: p-value = 2.1936598324339096e-31 Shapiro-Wilk test for perimeter_mean: p-value = 6.742514099128405e-12 Shapiro-Wilk test for area_mean: p-value = 1.2496485520357056e-17 Shapiro-Wilk test for smoothness_mean: p-value = 0.529057502746582 Shapiro-Wilk test for compactness_mean: p-value = 6.072096384729386e-12 Shapiro-Wilk test for concavity_mean: p-value = 1.0943981992391801e-16 Shapiro-Wilk test for concave points_mean: p-value = 1.0337378771324176e-41 Shapiro-Wilk test for symmetry_mean: p-value = 1.109789147388254e-39 Shapiro-Wilk test for fractal_dimension_mean: p-value = 1.0439673559219887e-41 Shapiro-Wilk test for radius_se: p-value = 1.522855340464036e-20 Shapiro-Wilk test for texture_se: p-value = 9.386578320169647e-11 Shapiro-Wilk test for perimeter_se: p-value = 1.2664812109195955e-20 Shapiro-Wilk test for area_se: p-value = 1.6546661938016331e-25 Shapiro-Wilk test for smoothness_se: p-value = 2.5943174917049702e-17 Shapiro-Wilk test for compactness_se: p-value = 3.939458834222064e-18 Shapiro-Wilk test for concavity_se: p-value = 8.416581923317268e-17 Shapiro-Wilk test for concave points_se: p-value = 5.649330447887735e-10 Shapiro-Wilk test for symmetry_se: p-value = 2.79441463255114e-17 Shapiro-Wilk test for fractal_dimension_se: p-value = 8.895442651533939e-42 Shapiro-Wilk test for radius_worst: p-value = 7.0335532238946855e-15 Shapiro-Wilk test for texture_worst: p-value = 0.00019242154667153955 Shapiro-Wilk test for perimeter_worst: p-value = 3.814766843110038e-34 Shapiro-Wilk test for area_worst: p-value = 3.872731402230384e-30 Shapiro-Wilk test for smoothness_worst: p-value = 0.012931867502629757 Shapiro-Wilk test for compactness_worst: p-value = 5.12819030201229e-15 Shapiro-Wilk test for concavity_worst: p-value = 2.7120848427285293e-13 Shapiro-Wilk test for concave points_worst: p-value = 1.0418825802105403e-08 Shapiro-Wilk test for symmetry_worst: p-value = 6.156057109386881e-13 Shapiro-Wilk test for fractal_dimension_worst: p-value = 1.9552446921462015e-15
# Mann-Whitney U test for comparing distributions of benign and malignant cases
for feature in train_data.columns[2:]:
stat, p_value = mannwhitneyu(train_data[train_data["diagnosis"]==0][feature],
train_data[train_data["diagnosis"]==1][feature])
print(f"Mann-Whitney U test for {feature}: p-value = {p_value}")
Mann-Whitney U test for radius_mean: p-value = 9.606217890757195e-49 Mann-Whitney U test for texture_mean: p-value = 5.297667926666895e-09 Mann-Whitney U test for perimeter_mean: p-value = 5.433245148759482e-51 Mann-Whitney U test for area_mean: p-value = 5.707545268513147e-49 Mann-Whitney U test for smoothness_mean: p-value = 3.896749542921967e-14 Mann-Whitney U test for compactness_mean: p-value = 9.842713126157597e-37 Mann-Whitney U test for concavity_mean: p-value = 9.038495525779925e-53 Mann-Whitney U test for concave points_mean: p-value = 1.2189396313054475e-55 Mann-Whitney U test for symmetry_mean: p-value = 4.760281492066189e-13 Mann-Whitney U test for fractal_dimension_mean: p-value = 0.7282519569708485 Mann-Whitney U test for radius_se: p-value = 5.606939812687955e-34 Mann-Whitney U test for texture_se: p-value = 0.5311141422265111 Mann-Whitney U test for perimeter_se: p-value = 6.903210913452115e-36 Mann-Whitney U test for area_se: p-value = 2.5797463849433016e-46 Mann-Whitney U test for smoothness_se: p-value = 0.14405996678951452 Mann-Whitney U test for compactness_se: p-value = 2.620399330772149e-15 Mann-Whitney U test for concavity_se: p-value = 1.236567113098441e-21 Mann-Whitney U test for concave points_se: p-value = 3.0073442359215313e-22 Mann-Whitney U test for symmetry_se: p-value = 0.02925017549776536 Mann-Whitney U test for fractal_dimension_se: p-value = 0.00010567945074825388 Mann-Whitney U test for radius_worst: p-value = 1.0763416655657176e-56 Mann-Whitney U test for texture_worst: p-value = 4.987782267252415e-21 Mann-Whitney U test for perimeter_worst: p-value = 9.973067727109487e-57 Mann-Whitney U test for area_worst: p-value = 2.15120622802156e-56 Mann-Whitney U test for smoothness_worst: p-value = 2.4934235102016866e-20 Mann-Whitney U test for compactness_worst: p-value = 1.6546556386962493e-37 Mann-Whitney U test for concavity_worst: p-value = 1.0073710431634533e-49 Mann-Whitney U test for concave points_worst: p-value = 9.314393496618401e-59 Mann-Whitney U test for symmetry_worst: p-value = 3.9936719114576526e-19 Mann-Whitney U test for fractal_dimension_worst: p-value = 9.596718603393983e-13
# Clustering using K-means
X_cluster = train_data.drop(["diagnosis", "id"], axis=1)
# Standardize the features
scaler = StandardScaler()
X_cluster_scaled = scaler.fit_transform(X_cluster)
# Determine the optimal number of clusters using the Elbow method
inertia = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(X_cluster_scaled)
inertia.append(kmeans.inertia_)
# Plot the Elbow method
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()
# Based on the Elbow method, let's choose K=2
kmeans = KMeans(n_clusters=2, random_state=42)
train_data['kmeans_cluster'] = kmeans.fit_predict(X_cluster_scaled)
# Evaluate the clustering using silhouette score
silhouette_avg = silhouette_score(X_cluster_scaled, train_data['kmeans_cluster'])
print(f"Silhouette Score for K-means: {silhouette_avg}")
Silhouette Score for K-means: 0.32166615571611196
# Dimensionality reduction using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster_scaled)
# Visualize the clusters in 2D using PCA
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=train_data['kmeans_cluster'], palette='viridis', legend='full')
plt.title('Clustering Visualization using PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
# Clustering using Hierarchical (Agglomerative) Clustering
agg_clustering = AgglomerativeClustering(n_clusters=2)
train_data['agg_cluster'] = agg_clustering.fit_predict(X_cluster_scaled)
# Evaluate the clustering using silhouette score
silhouette_avg_agg = silhouette_score(X_cluster_scaled, train_data['agg_cluster'])
print(f"Silhouette Score for Agglomerative Clustering: {silhouette_avg_agg}")
Silhouette Score for Agglomerative Clustering: 0.30262020886873026
# Clustering using DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
train_data['dbscan_cluster'] = dbscan.fit_predict(X_cluster_scaled)
# Visualize the clusters in 2D using PCA for DBSCAN
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=train_data['dbscan_cluster'], palette='viridis', legend='full')
plt.title('Clustering Visualization using PCA for DBSCAN')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
# Section 3: Supervised machine learning analysis
# Classification
X_classify = train_data.drop(["diagnosis", "id"], axis=1)
y_classify = train_data["diagnosis"]
# Split the data into training and testing sets
X_train_classify, X_test_classify, y_train_classify, y_test_classify = train_test_split(
X_classify, y_classify, test_size=0.2, random_state=42
)
# Train a classification model (Random Forest)
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_classify, y_train_classify)
# Predictions
y_pred_classify = clf.predict(X_test_classify)
# Evaluation for classification
accuracy_classify = clf.score(X_test_classify, y_test_classify)
print(f"Accuracy for Classification: {accuracy_classify}")
print("Classification Report:")
print(classification_report(y_test_classify, y_pred_classify))
Accuracy for Classification: 0.9770114942528736
Classification Report:
precision recall f1-score support
0 0.97 1.00 0.98 58
1 1.00 0.93 0.96 29
accuracy 0.98 87
macro avg 0.98 0.97 0.97 87
weighted avg 0.98 0.98 0.98 87
# Confusion matrix
conf_matrix = confusion_matrix(y_test_classify, y_pred_classify)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix for Classification")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# Regression
# Choose a numerical feature to predict
feature_to_predict = "radius_mean"
X_reg = train_data.drop(["diagnosis", "id"], axis=1) # Remove 'cluster' here
y_reg = train_data[feature_to_predict]
# Split the data into training and testing sets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X_reg, y_reg, test_size=0.2, random_state=42
)
# Train a regression model (Random Forest)
regressor = RandomForestRegressor(random_state=42)
regressor.fit(X_train_reg, y_train_reg)
# Predictions
y_pred_reg = regressor.predict(X_test_reg)
# Evaluation for regression
mse_reg = mean_squared_error(y_test_reg, y_pred_reg)
print(f"Mean Squared Error for Regression: {mse_reg}")
Mean Squared Error for Regression: 0.03412270495902717
# Visualization for regression
plt.figure(figsize=(10, 6))
plt.scatter(X_test_reg.index, y_test_reg, label="Actual", alpha=0.7)
plt.scatter(X_test_reg.index, y_pred_reg, label="Predicted", alpha=0.7)
plt.title(f"Regression - Actual vs Predicted ({feature_to_predict})")
plt.xlabel("Sample Index")
plt.ylabel(feature_to_predict)
plt.legend()
plt.show()
Reflecting on the Wisconsin Breast Cancer dataset analysis, the path has been illuminating and complex. During pre-processing, missing values, outliers, and feature engineering were handled carefully to maintain data integrity. Exploratory Data Analysis revealed complex correlations and patterns in the dataset. In unsupervised machine learning, clustering and dimensionality reduction revealed latent patterns in the data, helping identify important groups and attributes. The supervised learning phase increased complexity with classification and regression models predicting and classifying breast cancer instances. Throughout the analysis, model performance and ethical issues, particularly in healthcare, became more obvious. This approach has shown that data science is iterative, where insights from each phase feed future choices and improve comprehension of the information and its consequences.
This analysis relies on the Wisconsin Breast Cancer dataset, which may not capture all breast cancer changes. As a static dataset from a given place and date, the model may not be generalizable to various populations. The dataset's size may also limit the model's capacity to detect subtleties in breast cancer trends. This limitation emphasizes the necessity to assess the dataset's representativeness and breadth before extending results to larger contexts or periods. Further research might use bigger, more varied datasets to improve machine learning models for breast cancer detection and diagnosis.
Machine learning models in healthcare, especially for breast cancer detection, bring ethical concerns that must be considered. Securely handling and protecting sensitive medical data is crucial to patient privacy and data confidentiality. Transparent communication and securing informed permission from dataset contributors are essential to upholding ethical values. Historical medical data biases like healthcare access and demographic representation should be addressed and managed to avoid repeating algorithmic projections of inequality. To maximise the technology's benefits and minimize damage, healthcare machine learning requires continual monitoring, multidisciplinary cooperation, and ethical compliance.
In conclusion, our extensive analysis of the Wisconsin Breast Cancer dataset has illuminated machine learning for breast cancer detection and diagnosis. Preprocessing and exploratory data analysis provided the groundwork for comprehending the dataset, while unsupervised machine learning revealed patterns. In the supervised learning phase, classification and regression tests showed the model's capacity to distinguish benign and malignant instances and predict key numerical characteristics. The analysis has prioritized patient privacy, openness, and bias reduction.
Moving forward, the analysis's limitations the dataset's fixed form and previous medical data biases must be acknowledged. Future research should use bigger and more varied datasets to improve model generalizability. To appropriately handle healthcare technology, healthcare practitioners, data scientists, and ethicists must collaborate. Machine learning is crucial to medical diagnosis, thus ethical norms, openness, and continual development are essential to maximize its promise in healthcare while reducing hazards.